{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "<center>\n",
    "<img src=\"https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/img/ods_stickers.jpg\" />\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n",
    "\n",
    "<center>\n",
    "Auteur: [Yury Kashnitskiy](https://yorko.github.io). Traduit par Anna Larionova et [Ousmane Cissé](https://fr.linkedin.com/in/ousmane-cisse).  \n",
    "Ce matériel est soumis aux termes et conditions de la licence [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).  \n",
    "L'utilisation gratuite est autorisée à des fins non commerciales."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "# <center> Sujet 6. Régression</center>\n",
    "## <center>Lasso et Regression Ridge</center>\n",
    "\n",
    "*Le programme du cours diffère cette semaine du plan de l'article, car le sujet 4 (modèles linéaires) est trop vaste et important, nous couvrons donc la régression cette semaine.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-07T09:53:29.804503Z",
     "start_time": "2020-05-07T09:52:27.566842Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set()  # just to use the seaborn theme\n",
    "\n",
    "\n",
    "from sklearn.datasets import load_boston\n",
    "from sklearn.linear_model import Lasso, LassoCV, Ridge, RidgeCV\n",
    "from sklearn.model_selection import KFold, cross_val_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Nous travaillerons avec les données sur les prix des logements à Boston (référentiel UCI).**\n",
    "**Téléchargez les données.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-07T09:53:31.643705Z",
     "start_time": "2020-05-07T09:51:43.174Z"
    }
   },
   "outputs": [],
   "source": [
    "boston = load_boston()\n",
    "X, y = boston[\"data\"], boston[\"target\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Description des données:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(boston.DESCR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "boston.feature_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Regardons les deux premiers enregistrements.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Régression Lasso"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "La régression Lasso minimise l'erreur quadratique moyenne avec la régularisation L1:\n",
    "$$\\Large error(X, y, w) = \\frac{1}{2} \\sum_{i=1}^\\ell {(y_i - w^Tx_i)}^2 + \\alpha \\sum_{i=1}^d |w_i|$$\n",
    "\n",
    "où l'équation d'hyperplan $y = w^Tx$ en fonction des paramètres du modèle $w$, $\\ell$ est le nombre d'observations dans les données $X$, $d$ est le nombre d'entités, valeurs cibles $y$, coefficient de régularisation $\\alpha$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Ajustons la régression Lasso avec le petit coefficient $\\alpha$ (régularisation faible). Le coefficient lié à la caractéristique NOX (concentration en oxydes nitriques) sera nul. Cela signifie que cette caractéristique est la moins importante pour la prévision des prix médians des logements dans cette région.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso = Lasso(alpha=0.1)\n",
    "lasso.fit(X, y)\n",
    "lasso.coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Entraînons la régression Lasso avec $\\alpha=10$. Tous les coefficients sont égaux à zéro, à l'exception des caractéristiques ZN (proportion de terrains résidentiels zonés pour des lots de plus de 25 000 pieds carrés), TAXE (taux de taxe foncière de pleine valeur), B (proportion de Noirs par ville) et LSTAT (% de statut inférieur de la population).**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso = Lasso(alpha=10)\n",
    "lasso.fit(X, y)\n",
    "lasso.coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Cela signifie que la régression Lasso peut servir de méthode de sélection des caractéristiques.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_alphas = 200\n",
    "alphas = np.linspace(0.1, 10, n_alphas)\n",
    "model = Lasso()\n",
    "\n",
    "coefs = []\n",
    "for a in alphas:\n",
    "    model.set_params(alpha=a)\n",
    "    model.fit(X, y)\n",
    "    coefs.append(model.coef_)\n",
    "\n",
    "plt.rcParams[\"figure.figsize\"] = (12, 8)\n",
    "\n",
    "ax = plt.gca()\n",
    "# ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])\n",
    "\n",
    "ax.plot(alphas, coefs)\n",
    "ax.set_xscale(\"log\")\n",
    "ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis\n",
    "plt.xlabel(\"alpha\")\n",
    "plt.ylabel(\"weights\")\n",
    "plt.title(\"Lasso coefficients as a function of the regularization\")\n",
    "plt.axis(\"tight\")\n",
    "plt.show();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Maintenant, trouvons la meilleure valeur de $\\alpha$ lors de la validation croisée.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso_cv = LassoCV(alphas=alphas, cv=3, random_state=17)\n",
    "lasso_cv.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso_cv.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso_cv.alpha_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Dans Scikit-learn, les métriques sont généralement * maximisées *, donc pour MSE il y a une solution: `neg_mean_squared_error` est plutôt minimisé. Pas vraiment pratique.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cross_val_score(Lasso(lasso_cv.alpha_), X, y, cv=3, scoring=\"neg_mean_squared_error\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "abs(\n",
    "    cross_val_score(\n",
    "        Lasso(lasso_cv.alpha_), X, y, cv=3, scoring=\"neg_mean_squared_error\"\n",
    "    ).mean()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "abs(np.mean(cross_val_score(Lasso(9.95), X, y, cv=3, scoring=\"neg_mean_squared_error\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Encore un point ambigu: LassoCV trie les valeurs des paramètres par ordre décroissant pour faciliter l'optimisation. Il peut sembler que l’optimisation $\\alpha$ ne fonctionne pas correctement.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso_cv.alphas[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso_cv.alphas_[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(lasso_cv.alphas, lasso_cv.mse_path_.mean(1))  # incorrect\n",
    "plt.axvline(lasso_cv.alpha_, c=\"g\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(lasso_cv.alphas_, lasso_cv.mse_path_.mean(1))  # correct\n",
    "plt.axvline(lasso_cv.alpha_, c=\"g\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Régression Ridge"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "La régression Ridge minimise l'erreur quadratique moyenne avec la régularisation L2:\n",
    "$$\\Large error(X, y, w) = \\frac{1}{2} \\sum_{i=1}^\\ell {(y_i - w^Tx_i)}^2 + \\alpha \\sum_{i=1}^d w_i^2$$\n",
    "\n",
    "où l'équation d'hyperplan $y = w^Tx$ en fonction des paramètres du modèle $w$, $\\ell$ est le nombre d'observations dans les données $X$, $d$ est le nombre d'entités, valeurs cibles $y$, coefficient de régularisation $\\alpha$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Il existe une classe spéciale [RidgeCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear _model.RidgeCV.html # sklearn.linear_ model.RidgeCV) pour la régression Ridge validation croisée."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_alphas = 200\n",
    "ridge_alphas = np.logspace(-2, 6, n_alphas)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ridge_cv = RidgeCV(alphas=ridge_alphas, scoring=\"neg_mean_squared_error\", cv=3)\n",
    "ridge_cv.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ridge_cv.alpha_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**En cas de régression Ridge, aucun des paramètres ne se réduit à zéro. La valeur peut être petite mais non nulle.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ridge_cv.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_alphas = 200\n",
    "ridge_alphas = np.logspace(-2, 6, n_alphas)\n",
    "model = Ridge()\n",
    "\n",
    "coefs = []\n",
    "for a in ridge_alphas:\n",
    "    model.set_params(alpha=a)\n",
    "    model.fit(X, y)\n",
    "    coefs.append(model.coef_)\n",
    "\n",
    "ax = plt.gca()\n",
    "# ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])\n",
    "\n",
    "ax.plot(ridge_alphas, coefs)\n",
    "ax.set_xscale(\"log\")\n",
    "ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis\n",
    "plt.xlabel(\"alpha\")\n",
    "plt.ylabel(\"weights\")\n",
    "plt.title(\"Ridge coefficients as a function of the regularization\")\n",
    "plt.axis(\"tight\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Références\n",
    "- [Generalized linear models](http://scikit-learn.org/stable/modules/linear_model.html) (Generalized Linear Models, GLM) in Scikit-learn\n",
    "- [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression), [Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso), [LassoCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV), [Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) and [RidgeCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV) in Scikit-learn"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  },
  "name": "lesson8_part1_kmeans.ipynb",
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
